fix(controller): prevent plan step drift and reduce VLM false negatives#97
Conversation
abrichr
left a comment
There was a problem hiding this comment.
Review: Fixes for PR #97 feedback items
Pushed commit 422fd56 addressing the three feedback items.
1. Heuristic/verifier share no state so steps can drift
Root cause found: The PR already added _external_step_control to disable the agent heuristic when the controller is driving. However, the agent still injected its own stale plan progress text into Claude API messages from _plan_steps (which never gets updated under external control). Claude received two conflicting views: controller says step 3, agent says step 1.
Additionally, the agent done-override logic used stale _plan_steps even though the controller already handles done-override.
Fix applied (claude_computer_use_agent.py):
- _build_initial_messages(): Skips plan progress injection when _external_step_control=True
- Follow-up messages: Skips plan progress / demo re-injection under external control
- _process_response(): Skips premature done override when under external control
2. partially_verified leaves no trace for goal verification
Already well-addressed in the PR. The _build_step_verification_summary() correctly detects partially_verified steps, includes explanations, and augments goal text. No additional changes needed.
3. Regression tests mock only VLM parsing (no integration tests)
Added TestAgentPlanProgressSuppression class with 4 integration tests:
- Agent suppresses stale progress under external control (real agent, not mock)
- Agent injects progress normally without external control
- End-to-end: controller sets flag preventing stale progress
- Done override handled by controller not agent
All 162 tests pass (0 failures, 1 skip for missing demo file).
… dep - Add comment in reset() explaining why _external_step_control is not reset - Add comment on hasattr guard explaining MagicMock behavior is acceptable - Add docstring note in TestFalseNegativeRegressions about VLM response limitation - Add flask to test optional-dependencies for CI coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two improvements to the closed-loop demo-conditioned controller: 1. Plan step tracking drift prevention: _advance_plan_steps() now only compares current step vs next step, advancing at most one step per call. Previously, bulk keyword matching could jump 5+ steps on a single action. 2. VLM verification prompt tuning: Added "partially_verified" status for cases where the core outcome is achieved but with minor deviations (cursor position, formatting). Rewrote all verification prompts to be outcome-focused, reducing false negatives from live eval scenarios. Adds 68 new tests (8 drift prevention + 21 VLM prompt + 9 false-negative regressions + 30 existing test updates). All 147 controller tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analyzes unit economics of the closed-loop controller architecture: Claude agent costs, VLM verifier costs, scaling projections, and a three-phase strategy from loop-as-product to trained-model-as-product. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pyautogui.drag() uses relative coordinates that compound with starting position errors, making it unreliable for small targets like LibreOffice fill handles (~3x3 pixels). Replace with explicit mouseDown/moveTo/mouseUp sequence with timing delays for reliable drag operations. Also adds drag case to _build_pixel_command() for the pixel_action() path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…al verification Three issues addressed: 1. Heuristic/verifier step drift: The agent's keyword-based _advance_plan_steps() heuristic and the DemoController's VLM verifier operated on independent state, allowing them to disagree on which step was current. Fix: add _external_step_control flag to the agent that the DemoController sets at init, making _advance_plan_steps() a no-op when the controller manages step progression via VLM verification. 2. partially_verified invisible to goal verification: When steps were marked partially_verified, the final goal verification pass had no visibility into which steps had partial completions. Fix: _verify_goal() now builds a step verification summary and augments the goal text with it when noteworthy statuses (partially_verified, failed) exist. 3. Missing integration tests: Added TestHeuristicVerifierSync (4 tests) and TestGoalVerificationContext (5 tests) that verify the heuristic is properly disabled under controller management, step advancement is driven by VLM verification, and partial/failed step context reaches goal verification. Also added 2 agent-level tests for _external_step_control behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When DemoController sets _external_step_control=True, the agent's internal plan progress injection and done-override logic now become no-ops. This prevents the agent from sending conflicting step-tracking signals to the Claude model (agent says "step 1 in progress" while controller says "step 3 is current"). Three specific suppressions: 1. _build_initial_messages skips plan progress text injection 2. Follow-up messages skip plan progress / demo re-injection 3. Premature "done" override is left to the controller Adds integration tests exercising agent+controller interaction: - Agent suppresses progress under external control - Agent injects progress normally without external control - Controller's augmented task instruction reaches the agent - Done override handled by controller, not agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After WAA setup (close_all → verify_apps → download → open), the target application may be behind other windows, still loading, or obscured by notifications. This wastes 6+ agent steps recovering. Add _ensure_app_focused() with multi-strategy approach: - Maps task related_apps to window title patterns - Uses WAA /setup/activate_window endpoint (same as WAA postconfig) - Falls back to Alt+Tab - Retries 3x with increasing delays (2s, 3s, 5s) - Verifies foreground window title via pygetwindow on VM - Runs during reset(), does NOT count against agent step budget Also adds _APP_WINDOW_PATTERNS mapping, _get_expected_window_patterns(), _check_foreground_matches(), and _normalize_app_name() helpers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive analysis of GUI agent failure modes with taxonomy, recording system design, training viability assessment, and prioritized action plan. Key findings: - 4-category taxonomy: Environment, Agent Planning, Grounding, Verifier - Existing ExecutionTraceCollector needs only minor extensions - SFT on 50-100 corrected trajectories expected 10-30pp improvement - Deterministic infrastructure fixes should come first (Tier 1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… dep - Add comment in reset() explaining why _external_step_control is not reset - Add comment on hasattr guard explaining MagicMock behavior is acceptable - Add docstring note in TestFalseNegativeRegressions about VLM response limitation - Add flask to test optional-dependencies for CI coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ded41c8 to
fd5e0ae
Compare
Summary
_advance_plan_steps()to only compare current step vs next step, advancing at most one step per call. Previously, aggressive keyword matching could jump 5+ steps on a single action, causing the controller to skip intermediate steps entirely.partially_verifiedstatus andeffectively_verifiedproperty for cases where the core outcome is achieved but with minor deviations (cursor position, formatting). Rewrote all verification prompts (step, progress, goal) to be outcome-focused, reducing false negatives observed in live eval.Changes
claude_computer_use_agent.py_advance_plan_steps()limits to single-step advancementplan_verify.pypartially_verifiedstatus,effectively_verifiedproperty, outcome-focused promptsdemo_controller.pyeffectively_verifiedinstead of== "verified"test_claude_computer_use_agent.pytest_demo_controller.pytest_plan_verify.pyTest plan
🤖 Generated with Claude Code